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Abstract 

Community detection is a fundamental problem in network analysis, with applications in 
many diverse areas. The stochastic block model is a common tool for model-based community 
detection, and asymptotic tools for checking consistency of community detection under the 
block model have been recently developed [5|. However, the block model is limited by its 
assumption that all nodes within a community are stochastically equivalent, and provides a 
poor fit to networks with hubs or highly varying node degrees within communities, which are 
common in practice. The degree-corrected stochastic block model [l^l was proposed to address 
this shortcoming, and allows variation in node degrees within a community while preserving 
the overall block community structure. In this paper, we establish general theory for checking 
consistency of community detection under the degree-corrected stochastic block model, and 
compare several community detection criteria under both the standard and the degree-corrected 
models. We show which criteria are consistent under which models and constraints, as well as 
compare their relative performance in practice. We find that methods based on the degree- 
corrected block model, which includes the standard block model as a special case, are consistent 
under a wider class of models; and that modularity-type methods require parameter constraints 
for consistency, whereas likelihood-based methods do not. On the other hand, in practice the 
degree correction involves estimating many more parameters, and empirically we find it is only 
worth doing if the node degrees within communities are indeed highly variable. We illustrate 
the methods on simulated networks and on a network of political blogs. 



1 Introduction 



Networks have become one of the more common forms of data, and network analysis has received 
a lot of attention in computer science, physics, social sciences, biology, and statistics (s ee Il3l . llSl . 
251] for reviews). The applications are many and varied, including social networks 38|, |31[, gene 



regulatory networks j33], recommender systems, and security monitoring. One of the fundamental 
problems in network analysis is community detection, where communities are groups of nodes 
that are, in some sense, more similar to each other than to other nodes. The precise definition 
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of community, like that of a cluster in multivariate analysis, is difficult to formalize, but many 
methods have been developed to address this problem (see [l5|, UM, for comprehensive recent 
reviews), often relying on the intuitive notion of community as a group of nodes with many links 
between themselves and fewer links to the rest of the network. 

Three groups of methods for community detection can be loosely identified in the literature. A 
number of greedy algorithms such as hierarchical clustering have been proposed (see [23] for a 
review), which we will not focus on in this paper. The second class of methods involves optimization 
of some "reasonable" global criteria over all possible network partitions and includes graph cuts 



[39l . |34| |. spectral clustering [28], and modularity [26|, |2j], the latter discussed in detail below. 
Finally, model-based methods rely on fitting a probabilistic model for a network with communities. 
Perhaps the best known such model is the stochastic block model, which we will also refer to as 
simply the block model fl^ . 36, [i^. Other models include a recently introduced degree-corrected 
stochastic block model [20|], mixture models for directed networks 271], multivariate latent variable 



models [la ], latent feature models [171]. and mixed membership stochastic block models for modeling 
overlapping communities [3]- From the algorithmic point of view, many model-based methods also 
lead to criteria to be optimized over all partitions, such as the profile likelihood under the assumed 
model. 

The large number of available methods leads to the question of how to compare them in a principled 
manner, other than on individual examples. There has been little theoretical analysis of community 
detection methods until very recently, when a consistency framework for community detection was 
introduced by Bickel and Chen [H]. They developed general theory for checking the consistency of 
detection criteria under the stochastic block model (discussed in detail below) as the number of 
nodes grows and the number of communities remains fixed, and their result has been generalized 
to allow the number of communities to grow in 0; see also 0. The stochastic block model, 
however, has serious limitations in practice: it treats all nodes within a community as stochastically 
equivalent, and thus does not allow for the existence of "hubs", high-degree nodes at the center of 
many communities observed in real data. To address this issue, Karrer and Newman proposed 
the degree-corrected stochastic block model, which can accommodate hubs (a similar model for 



directed network was previously proposed in 371], but they did not focus on community detection 



and assumed known community membership). In [20], the authors gave several examples showing 
this model fits data with hubs much better than the block model; however, there are no consistency 
results available under this new model, and thus no way to compare methods in general. 

In this paper, we generalize the consistency framework of [H] to the degree-corrected stochastic 
block model, and obtain a general theorem for community detection consistency. Since the degree- 
corrected model includes the regular block model as a special case, consistency results under the 
block model follow automatically. We then evaluate two types of modularity and the two criteria 
derived from the block model and the degree-corrected block model using this general framework. 
One of our goals is to emphasize the difference between assumed models (needed for theoretical 
analysis) and criteria for finding the optimal partition, which may or may not be motivated by a 
particular model. What we ultimately show agrees with statistical common sense: criteria derived 
from a particular model are consistent when this model is assumed, but not necessarily consistent 
if the model does not hold. Further, if a criterion relies implicitly on an assumption about the 
model parameters (e.g., modularity implicitly assumes that links within communities are stronger 
than between), then it will be consistent only if the model parameters are constrained to satisfy 
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this assumption. We make all of the above statements precise later in the paper. 

The rest of the article is organized as follows. We set up all notation and define the relevant models 
and criteria in Section [2l Consistency results under the regular and the degree-corrected stochastic 
block models for all of the criteria in Section [2] are stated in Section [3l The general consistency 
theorem which implies all of these results is presented in Section SI In Section [5l we compare the 
performance of these criteria on simulated networks, and in Section [H we illustrate the methods 
on a network of political blogs. Section [7] concludes with summary and discussion. All proofs are 
given in the Appendix. 



2 Network models and community detection criteria 

Before we proceed to discuss specific criteria and models, we introduce some basic notation. A 
network A'^ = (V,E), where V is the set of nodes (vertices), \V\ = n, and E is the set of edges, can 
be represented by its n x n adjacency matrix A = [Aij], where Aij = 1 if there is an edge from i to 
j, and Aij = otherwise. We only consider unweighted and undirected networks here, and thus A 
is a binary symmetric matrix. The community detection problem can be formulated as finding a 
disjoint partition V = ViU ■ ■ ■ U Vk, or equivalently a set of node labels e = {ei, e„}, where Cj 
is the label of node i and takes values in {1, 2, K}. 

For any set of label assignments e, let 0(e) be the K x K matrix defined by 

Oki{e) = '^Aijl{ei = k,ej = 1} , 

where I is the indicator function. Further, let 

Okie) = Oki{e) , L = YAij . 

I ij 

For k ^ I, Oki is the total number of edges between communities k and l; Ok is the sum of node 
degrees in community fc, and L is the sum of all degrees in the network. If self-loops are not 
allowed (i.e., An = is enforced), then we can also interpret Okk as twice the total number of 
edges within community k, and L as twice the number of edges in the whole network. Finally, let 
refe(e) = /{cj = k} be the number of nodes in the A:-th community, and /(e) = (^, ^) . 

The stochastic block model, which is perhaps the most commonly used model for networks with 
communities, postulates that, given node labels c = {ci,...,Cn}, the edge variables Aij^s are 
independent Bernoulli random variables with 

E[M = Pc.c, , (1) 

where P = [Pab] is a K x K symmetric matrix. We will use this formulation throughout the paper, 
which allows for self-loops. While it is also common to exclude self-loops, sometimes they are 
present in the data (as in our example in Section [U and allowing them leads to simpler notation. 
In principle, all of our results go through for the version of the models with self-loops excluded, 
with appropriate modifications made to the proofs. 
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Under the model ([T]), all nodes with the same label are stochastically equivalent to each other, 
which in practice limits the applicability of the stochastic block model, as pointed out in f2oj]. The 
alternative proposed in the degree-corrected stochastic block model, is to replace ([T|) with 

E[Aij] = OiOjPc-cj , (2) 

where 9i is a "degree parameter" associated with node i, reflecting its individual propensity to 
form ties. The degree parameters have to satisfy a constraint to be identifiable, which in [20I ] was 
set to ^i9il{ci = k) = 1, for each k (other constraints are possible). Further, they replaced the 
Bernoulli likelihood by the Poisson, to simplify technical derivations. With these assumptions, a 
profile likelihood can be derived by maximizing over 6 and P, giving the following criterion to be 
optimized over all possible partitions: 

(3DCBM(e) = Oki log . (3) 

kl 

We have compared the performance of this criterion in practice to its slightly more complicated 
version based on the (correct) Bernoulli likelihood instead of the Poisson, and found no difference 
in the solutions these two methods produce. The Bernoulli distribution with a small mean is well 
approximated by the Poisson distribution, and most real networks are sparse, so one can expect 
the approximation to work well; see also a more detailed discussion of this in ^]. We will use ([3]) 
in all further analysis, to be consistent with and take advantage of the simpler form. 

The degree-corrected model includes the regular stochastic block model as a special case, with 
all 0j's equal. Enforcing this additional constraint on the profile likelihood leads to the following 
criterion to be optimized over all partitions: 

QBM(e) = VOHlog-^ . (4) 

Like criterion ([3]) , this is based on the Poisson assumption but gives identical results to the Bernoulli 
version in practice. Here we use the form Q for consistency with ([3]) and with [2ol |. 

A difi'erent type of criterion used for community detection is modularity, introduced in [2fi]; see 
also [24] and 1.233 ■ The basic idea of modularity is to compare the number of observed edges within 
a community to the number of expected edges under a null model, and maximize this difference 
over all possible community partitions. Thus the general form of a modularity criterion is 

Qie) = J2[AJ-P^J]I{e^ = e,), (5) 

where Pij is the (estimated) probability of an edge falling between i and j under the null model. 
The convention in the physics literature is to divide Q by L, which we omit here, since it does not 
change the solution. 

The choice of the null model, that is, of a model with no communities (K = 1) determines the exact 
form of modularity. The stochastic block model with K = 1 is simply the Erdos-Renyi random 
graph, where Pij is a constant which can be estimated by L/v?. Plugging Pij = L/v? into ([5]) gives 
what we will call the Erdos-Renyi modularity (ERM), 

QERM(e) = J] fofcfc - . (6) 
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If instead we take the degree-corrected model with K = 1 as the null model, it postulates that 
Pij (X 6i9j, where 6i is the degree parameter. This is essentially the well-known expected degree 
random graph, also known as the configuration model. In this case, Pij can be estimated by 
didj/L, where di = Ylj-^ij is the degree of node i. Substituting this into ([5]) gives the popular 
Newman-Girvan modularity (NGM), introduced in 26l |: 

QNGM(e) = ^(Ofcfe - ^L). (7) 

k 



The four different criteria for community detection are summarized in Table [TJ Note that the 
two likelihood-based criteria, BM and DCBM, take into account all links within and between 
communities, and which communities they connect; whereas the modularities would not change if 
all the links connecting different communities were randomly permuted (as long as they did not 
become links within communities). Further, note that the degree correction amounts to substituting 
Ok for nfc and L for n, both for modularity and likelihood-based criteria. Thus, if all nodes 
within a community are treated as equivalent, their number suffices to weigh community strength 
appropriately; and if the nodes are allowed to have different expected degrees, then the number of 
edges becomes the correct weight. Both of these features make sense intuitively, and, as we will see 
later, will fit in naturally with consistency conditions. 



Table 1: Summary of community detection criteria 





Block model 


Degree corrected block model 


Modularity 
Likelihood 


Efc(Ofcfe-S^) (ERM) 
T.kiOki\og^^{BM) 


Efc(Ofcfc-fi) (NGM) 
Efc^Onlog^ (DCBM) 



Our analysis indicates that Newman-Girvan modularity and degree-corrected block model criteria 
are consistent under the more general degree-corrected models but Erdos-Renyi modularity and 
block model criteria are not, even though they are consistent under the regular block model. 
Further, we show that likelihood-based methods are consistent under their assumed model with 
no restrictions on parameters, whereas modularities are only consistent if the model parameters 
are constrained to satisfy a "stronger links within than between" condition, which is the basis of 
modularity derivations. In short, we show that a criterion is consistent when the underlying model 
and assumptions are correct, and not necessarily otherwise. 



3 Consistency of community detection criteria 

Here we present all the consistency results for the four different criteria defined in Section [2j All 
these results follow from the general consistency theorem in Section IH the proofs are given in the 
Appendix. The notion of consistency of community detection as the number of nodes grows was 
introduced in [5]. They defined a community detection criterion Q to be consistent if the node 
labels obtained by maximizing the criterion, c = argmaXg Q(e), satisfy 

P[c = c] -> 1 as n -> oo . (8) 
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Strictly speaking, this definition suffers from an identifiability problem, since most reasonable 
criteria, including all the ones discussed above, are invariant under a permutation of community 
labels {!,..., K}. Thus a better way to define consistency is to replace the equality c = c with the 
requirement that c and c belong to the same equivalence class of label permutations. For simplicity 
of notation, we still write c = c in all consistency results in the rest of the paper, but take them to 
mean that c and c are equal up to a permutation of labels. 



The notion of consistency in ([8]) is very strong, since it requires asymptotically no errors. One can 
also define what we'll call weak consistency, 



Ve > 0, 




< e 



—7-1, as n ^ oo. 



(9) 



where equality is also interpreted to mean membership in the same equivalence class with respect to 
label permutations. In [fi], conditions were established for a criterion to be weakly consistent under 
the stochastic block model. All other assumptions being equal, weak consistency only requires that 
the expected degree of the graph A„ ^ oo, whereas strong consistency requires A„/logn oo. 
Here, we will analyze both strong and weak consistency under the degree-corrected stochastic block 
model. 



For the asymptotic analysis, we use a slightly different formulation of the degree-corrected model 
than that given by The main difference is that we treat true community labels c and degree 
parameters 9 = {6i, . . . , 6n) as latent random variables rather than fixed parameters. Note, how- 
ever, that the criteria we analyze were obtained as profile likelihoods with parameters treated as 
constants. This is one of the standard approaches to random effects models, known as conditional 
likelihood (see p. 234 of [2l|). The network model we use for consistency analysis can be described 
as follows: 



1. Each node is independently assigned a pair of latent variables {ci,9i), where q is the com- 
munity label taking values in 1, . . . ,K, and 9i is a discrete "degree variable" taking values in 
xi < • • • < X]\f. We do not assume that q is independent of 6i. 

2. The marginal distribution of c is multinomial with parameter tt = (tti, . . . , vTi^-)^, and 6 
satisfies E[9i] = 1 for identifiability. 

3. Given c and 6, the edges Aij are independent Bernoulli random variables with 

E[Aij\c,0] = OiOjPciCj ■ 
where P = [Pab] is a K x K symmetric matrix. 



For simplicity, we allow self-loops in the network, i.e., i?[^M|c, 0] = OfPc-a- Otherwise diagonal 
terms of A have to be treated separately, which ultimately makes no difference for the analysis but 
makes notation more awkward. 

To ensure that all probabilities are always less than 1, we require the model to satisfy the constraint 
xfjinaK-afiPab ^ 1- We also need to consider how the model changes with n. If Pab remains 
fixed as n grows, the expected degree A„ will be proportional to n, which makes the network 
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unrealistically dense. Instead, we allow the matrix P to scale with n and, in a slight abuse of 
notation, reparameterize it as Pn = PnP, where pn = P{Aij = 1) — )• and P is fixed. We then 
specify the rate of c the expected degree = npn, which has to satisfy — )• oo for strong 
consistency and A^i ^ oo for weak consistency. 

Let n be the K x M matrix representing the joint distribution of (q, 6i) with P(cj = a,9i = Xu) = 
Hau- Further, define TTa = "^u^u^i^u- Note that Yla^a = 1 since = 1. Moreover, we have 

T^a = TTa if c and 6 are independent, or if 0^ = 1 (block models). Thus we can view n as an adjusted 
version of vr. 

Next, we state our consistency results for the two types of modularities under both the degree- 
corrected and the standard block model. 

Theorem 3.1. Under the degree- corrected stochastic block model, if the parameters satisfy 

£aa> , Sab < for all a^b, 

where Pq = J2ab^aTTbPab, Wab = p^^"^ , £ = W - {W1){W1)'^, the Newman-Girvan modularity 
is strongly consistent when A^/logn — >■ oo, and weakly consistent when An ^ oo. 

The parameter constraints in Theorem 13.11 require, essentially, that the links within communities 
are more likely than the links between. This is particularly easy to see when K = 2, in which case 
the constraint simplifies to 

-P11-P22 > Pl2- 

Taking 0j = 1, we immediately obtain 

Corollary 3.1. (established in f^,]): Under the standard stochastic block model with parameters 
satisfying Theorem \3.1\ constraints with vr replaced by tt, Newman-Girvan modularity is strongly 
consistent when An/logn — )■ 00, and weakly consistent when Xn ^ 00. 

For Erdos-Renyi modularity, which has not been studied theoretically before, we can also show 
consistency under the standard block model, albeit with a slightly stronger condition on links 
within communities being more likely than the links between: 

Theorem 3.2. Under the standard stochastic block model, if the parameters satisfy 

Paa 

> Po , Pab < Po for alla^b , 

where Pq = ^ab'^a'^^bPab, the Erdos-Renyi modularity criterion ([6]) is strongly consistent when 
Xn/logn — >■ 00, and weakly consistent when A„ — >■ 00. 

However, the Erdos-Renyi modularity is not consistent under the degree-corrected model, at least 
not under the same parameter constraint. The Erdos-Renyi modularity prefers to group nodes with 
similar degrees together, which may not agree with true communities when the variance in node 
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degrees is large. Here is a counter-example demonstrating this. Let K = 2,7v = (1/2, 1/2)-^, p„ = 1 
(so that the graph becomes dense as n — > oo), and 



Further, 6 is independent of c and takes only two values, 1.6 and 0.4, with probability 1/2 each. If we 
assign all nodes their true labels, the population version of the criterion (where all random quantities 
are replaced by their expectations under the true model) gives Qerm = 0.0125. However, by 
grouping nodes with the same value of 9iS together, we get the population version of Qerm = 0.0135, 
higher than the value for the true partition, and this solution will therefore be preferred in the limit. 

Once again, the result makes sense intuitively, since the Erdos-Renyi modularity uses the regular 
block model as its null hypothesis, and the parameter constraint matches the "fewer links between 
than within" notion. From the algorithmic point of view, the main difference between Erdos-Renyi 
modularity and Newman-Girvan modularity is that the latter depends on the edge matrix O only 
and "weighs" communities by the number of edges, whereas the former weighs communities by the 
number of nodes (which, under the block model, is proportional to the number of edges, but 
under the degree-corrected model is not). 

Next we state the consistency results for the two criteria derived from profile likelihoods, DCBM 
([3]) and BM These require no parameter constraints. 

Theorem 3.3. Under the degree- corrected stochastic block model (and therefore under the regular 
model as well), the degree- corrected criterion ([3|) is strongly consistent when A„/logn — )• oo, and 
weakly consistent when A„ — )■ c«. 

Theorem 3.4. Under the stochastic block model, the block model criterion ^ is strongly consistent 
when A„/logn — ?> oo, and weakly consistent when A„ ^ oo. 

Theorem 13.41 was proved in [5] for a slightly different form of the profile likelihood (Bernoulli 
rather than the Poisson). Under the degree-corrected block model, criterion @ is not necessarily 
consistent - the same counter-example can be used to demonstrate this. As was the case with 
modularities, the criterion consistent under the degree-corrected block model depends on O only, 
whereas the criterion consistent only under the regular block model also depends on n^. 

The theoretical results suggest that the likelihood-based criteria are always preferable over the 
modularity-based criteria, and that criteria based on the degree-corrected model are always pre- 
ferred to the criteria based on the regular block model, since they are consistent under weaker 
conditions. In practice, however, this may not always hold. Computationally, modularity type 
criteria can be approximately optimized by solving an eigenvalue problem [2^, whereas likelihood 
type criteria have no such approximations and thus have to be optimized by slower heuristic search 
algorithms, as was done in [5j and Moreover, fitting the degree-corrected block model requires 
estimating many more parameters than fitting a block model, and creates the usual trade-off be- 
tween model complexity and goodness of fit. If the node degrees within communities do not vary 
widely, fitting a block model may provide a better solution; see more on this in Section [5l 
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4 A general theorem on consistency under degree-corrected stochas- 
tic block models 



Here we prove a general theorem for checking consistency under degree-corrected stochastic block 
models for any criterion defined by a reasonably nice function. All consistency results for specific 
methods discussed in Section [3] are corollaries of this theorem. 

A large class of community detection criteria can be written as 

Q{e) = F(^,f{e)) , (10) 

where Hn = n^Pn- For instance, many graph cut methods (mincut, ratio cut [39], normalized cut 
[s^ l) have this form and use functions that are designed to minimize the number of edges between 
communities. All criteria discussed in Section [3] can also be written in this form. Our goal here 
is to establish conditions for consistency of a criterion of this form under degree-corrected block 
models. 

A natural condition for consistency is that the "population version" of Q{e) should be maximized 
by the correct community assignment, as in M-estimation. To define the population version of Q, 
we first define functions H{S) and h{S) corresponding to population versions of 0(e) and /(e), 
respectively (the precise meaning of "population version" is clarified in Proposition 14. II below) . For 
any generic array S = [Skau] G 7^^x^x*^, define aK xK matrix H{S) = [Hki{S)] by 

Hkl{S) = ^ ^ XuX^PabSkauSlbv j 
abuv 

and a if-dimensional vector h{S) = {hk{S)\ by 

hk{S) = ^ Skau- 
au 

Also define R{e) (zu^o^KxM 

1 

Rkau{e) = - I{ei = k,Ci = a, Oi = x„) . 

Then we have 
Proposition 4.1. 

fk{e) 



Hki{R{e)) , 
= hk{R{e)) . 



(11) 
(12) 



Proposition 14.11 explains the precise meaning of "population version" : we take the conditional 
expectations given c and 0, and write them as functions of a generic variable S instead of R{e). 
The population version of Q is defined as F{H{S), h{S)). 

Now we can specify the key sufficient condition as follows: 
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(*) F{H{S), h{S)) is uniquely maximized over y = {S : S > 0, J2k ^kau = ^au} by 5 = B, with 
BfcaM = ^auEka^ for any a and u, where E is any row permutation of a x X identity matrix. 



The matrix E deals with the permutation equivalence class. Since R{c) — > B as n ^ oo, S = B 
implies each class k exactly matches a community in the population. For simplicity, in what follows 
we assume that E is in fact the identity matrix itself. We will elaborate on this condition below. 
In addition, we need some regularity conditions, analogous to those in [HI: 

(a) F is Lipschitz in its arguments; 

(b) Let W = H{B). The directional derivatives ^(Mq + e(Mi - Mo),to + - to))\e=o+ are 
continuous in (Mi,ti) for all (Mo,to) in a neighborhood of (VF, tt); 

(c) Let G{S) = F{H{S),h{S)). Then on ^, dG{{i-e)D+eS) |^^^^ < < o for ah 7r,P. 
Now we are ready to state the main theorem. 

Theorem 4.1. For any Q{e) of the form (jlOp . if7v,P,F satisfy (*), (a)-(c), then Q is strongly 
consistent under degree- corrected stochastic block models if —>■ oo and weakly consistent if 
Xn — >■ oo. 

The proof is given in Appendix. This theorem is a generalization of Theorem 1 in 0] from the 
standard stochastic block models to degree-corrected models, and it implies all of the consistency 
results in Section EJ 

Finally, we return to the key condition (*). If Q(e) is maximized by the true community labels c, 
then as n — )• oo, F{H{S),h{S)), the population version of Q{e), should also be maximized by the 
true partition S = B, since R{c) — )• B and Q{c) F{H(B), /i(B)), making (*) a natural condition. 
Further, since for any e, '^^k ^kau{^) — > ^au, the limit S of R{e) must satisfy Ylk^kau = ^au- 
Therefore, we only need to consider maximizers of F{H{S),h{S)) satisfying this constraint. 



5 Numerical evaluation 



In this section, we compare the performance of the four community detection criteria from Section 
[2] on simulated data, generated from the regular or the degree-corrected block model. The criteria 
are maximized over partitions using a greedy label-switching algorithm called tabu search (3. 14 1. 



The key idea of tabu search is that once a node label has been switched, it will be "tabu" and 
not available for switching for a certain number of iterations, to prevent being trapped in a local 
maximum. Even though tabu search cannot guarantee convergence to the global maximum, it 
performs well in practice. Moreover, we run the search for a number of initial values and different 
orderings of nodes, to help avoid local maxima. 



To compare the solution to the true labels, we use the adjusted Rand index '19'], a measure of 
similarity between partitions commonly used in clustering. We have also computed the normalized 
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mutual information, a measure more commonly used by physicists in the networks literature, which 
gives very similar results (not reported to save space). The adjusted Rand index is scaled so that 1 
corresponds to the perfect match and to the expected difference between two random partitions, 
with higher values indicating better agreement. The figures in this section all present the median 
adjusted Rand index over 100 replications. 

In all examples below, we generate networks with n = 1000 nodes and K = 2 communities. The 
node labels are generated independently with P(cj = 1) = vr, P(cj = 2) = 1 — vr. By varying vr, we 
can investigate robustness of the methods to unbalanced community sizes. The probability matrix 
for the block model and the degree-corrected block model is set to 




where we vary p to obtain different expected degrees A. 



5.1 The degree-corrected stochastic block model 



For this simulation, we generate data from the degree-corrected model with two possible values for 
the degree parameter 9. The degree parameters are generated independently from the labels, with 



P{e, = mx) = P{6i 



which implies x 



x) = 1/2 , 

1. We vary the ratio m from 1 (the regular 



since we need to have E{9i 
block model) to 10, which allows us to study the effect of model misspecification on the regular 
block model. In this simulation, the community sizes are balanced (vr = 0.5). 



(a) A = 125 



(b) A = 25 



(c) A = 12 
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Figure 1: Results for the degree-corrected stochastic block model with two values for the degree 
parameters, vr = 0.5, m varies. 



Figure [T] shows the results for three different expected degrees A. For the densest network with 
A = 125 in Figure [TJa), the degree-corrected block model and Newman-Girvan modularity perform 
the best overall, as they assume the correct model and the methods are consistent. At m = 1, 
the regular block model is just as good, but its performance deteriorates rapidly as m increases. 
The Erdos-Renyi modularity also performs perfectly for m = 1, and it takes larger values of m 
for its performance to deteriorate than for block model likelihood, so we can conclude that the 
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Erdos-Renyi modularity is more robust to variation in degrees. For both of them, poor results are 
due to grouping nodes with similar degrees together. The overall trend for sparser networks (Figure 
[T](b) and (c)) is similar, but all methods perform worse, as with fewer links there is effectively less 
data to use for fitting the model, and the effect is more pronounced for large m, when degrees have 
higher variance. 

(a) A = 125 (b) A = 25 (c) A = 12 




Figure 2: Results for the standard stochastic block model, m = 1, vr varies. 



5.2 The stochastic block model 

Here we focus on the standard stochastic block model (m = 1) and vary tt to assess robustness 
to unbalanced community sizes. All the four criteria are consistent in this case, but in practice 
the closer vr is to 0.5, the better they perform (Figure [2]), with the exception of the block model 
likelihood in the dense case (A = 125), where it performs perfectly for all vr. Overall, the block 
model likelihood performs best, which is natural because it is the maximum likelihood estimator of 
the correct model. The Erdos-Renyi modularity also performs better than the other two criteria, 
which overfit the data by assuming the degree-corrected model and accounting for variation in 
observed degrees, which in this case only adds noise. 



5.3 Unbalanced community sizes 



(a) A = 125 (b) A = 25 (c) A = 12 




Figure 3: Results for the degree-corrected stochastic block model with two values for the degree 
parameters, vr = 0.3, m varies. 
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In this simulation we consider the degree-corrected stochastic block model with unbalanced com- 
munity sizes. We fix vr = 0.3 and vary the ratio m in Figure El For a dense network (A = 125, 
Figure [3|) (a)), the performance with tt = 0.3 is similar to the balanced case with vr = 0.5 (Figure 
[U^a)). However, in sparser networks modularity performs much worse with unbalanced community 
sizes. This can also be seen in Figure [2] for the case m = 1. The failure of modularity to deal with 
unbalanced community sizes was also recently pointed out by 40|]. Note also that in the sparsest 
case (A = 12, Figure [3|), the degree-corrected model suffers from over-fitting when m = 1, as was 
also seen in Figure [2l 



5.4 A different degree distribution 



In the last simulation, we test the sensitivity of all methods, but in particular the degree-corrected 
model, to the assumption of a discrete degree distribution. Here we sample the degree parameters 
9i independently from the following distribution: 

{r]i w.p. a, 

2/(m + l) w.p. (l-a)/2, 
2m/(m+ 1) w.p. (1 - a)/2 , 

where r]i is uniformly distributed on the interval [0,2]. The variance of 6i is equal to a/3 + (1 — 
a){m — Vf' /{m + 1)^. In this simulation, we fix m = 10, which makes the variance a decreasing 
function of a, and vary a from to 1. We also fix vr = 0.5. 

(a) A = 125 (b) A = 25 (c) A = 12 




Figure 4: 

The results in Figure H] show that the degree-corrected block model likelihood and Newman-Girvan 
modularity still perform well, which suggests that the discreteness of 9 is not a crucial assumption. 
The regular block model fails in this case, as we would expect from earlier results since m = 10, 
but the performance of the Erdos-Renyi modularity improves as a increases, which agrees with our 
earlier observation on its relative robustness to variation in degrees. 



6 Example: the political blogs network 



In this section, we analyze a real network of political blogs compiled by [Ij]. The nodes of this 
network are blogs about US politics and the edges are hyperlinks between these blogs. The data 
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were collected right after the 2004 presidential election and demonstrate strong divisions; each blog 
was manually labeled as liberal or conservative by which we take as ground truth. Following 
the analysis in 20(], we ignore directions of the hyperlinks and focus on the largest connected 
component of this network, which contains 1222 nodes, 16714 edges, and has the average degree of 
approximately 27. Some summary statistics of the node degrees are given in Table [21 which shows 
that the degree distribution is heavily skewed to the right. 



Table 2: Statistics of node degrees in the political blogs network 
Mean Median Min 1st Qt. 3rd Qt. Max 

27.36 13.00 1.00 3.00 36.00 351.00 



We compare the partitions into two communities found by the four different community detection 
criteria with the true labels using the adjusted Rand index. The Newman-Girvan modularity and 
the degree-corrected model find very similar partitions (they differ over only four nodes, and have 
the same adjusted Rand index value of 0.819, the highest of all methods). The partition found by 
the Erdos-Renyi modularity has a slightly worse agreement with the truth (adjusted Rand index of 
0.793). The block model likelihood divides the nodes into two groups of low degree and high degree, 
with the adjusted Rand index of nearly 0, which is equivalent to random guessing. The results are 
shown in Figure [5] (drawn using the igraph package in R [it] with the Fruchterman and Reingold 
layout [13] )■ These are consistent with what we observed in simulation studies: the Newman-Girvan 
modularity and the degree-corrected block model likelihood perform better in a network with high 
degree variation, and the Erdos-Renyi modularity is more robust to degree variation than the block 
model likelihood. 

All criteria were maximized by tabu search, but for modularities we also computed the solutions 
based on the eigendecomposition of the modularity matrix. Both solutions were worse that those 
found by tabu search, but while for Newman-Girvan modularity the difference was slight (the 
adjusted Rand index of 0.781 instead of 0.819), eigendecomposition of the Erdos-Renyi modularity 
yielded a poor result similar to that of block model likelihood (with adjusted Rand index value of 
0.092 instead of 0.819 by tabu search). This suggests that Erdos-Renyi modularity is numerically 
less stable under high degree variation, in addition to being theoretically not consistent. More 
analysis of the eigendecomposition-based solutions is needed for both types of modularities to 
understand conditions under which these approximations work well. 



7 Summary and discussion 



In this paper, we developed a general tool for checking consistency of community detection criteria 
under the degree-corrected stochastic block model, a more general and practical model than the 
standard stochastic block model for which such theory was previously available [H]- This general 
tool allowed us to obtain consistency results for four different community detection criteria, and, 
to the best of our knowledge for the first time in the networks literature, to clearly separate the 
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(a) BM 



(b) DCBM 





(e) True labels 




Figure 5: Political blogs data. Node area is proportional to the logarithm of its degree and the 
colors represent community labels. 



effects of the model assumed for criteria derivation from the model assumed true for analysis of the 
criteria. What we have shown is, essentially, statistical common sense: methods are consistent when 
the model they assume holds for the data. The parameter constraints are needed when methods 
implicitly rely on them, although we found that the two different modularity methods, while using 
the same constraint in spirit, require somewhat different conditions on parameters to be consistent. 
The theoretical analysis agrees well with both simulation studies and the data analysis, which also 
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indicate that the methods with better theoretical consistency properties do not always perform 
best in practice: there is a cost associated with fitting the extra complexity of the degree-corrected 
model, and if there is not enough data for that, or the data does not have much variation in node 
degrees, simpler methods based on the standard stochastic block model will in fact do better. 



There are many questions that require further investigation here, even in the context of model- 
based community detection when a model is assumed true. For example, we assumed that K 
is known, which is not unreasonable in some cases (e.g., dividing political blogs into liberal and 
conservative), but is in general a difficult open problem in community detection. Standard methods 
such as AIC and BIC do not seem to lend themselves easily to this case, because of parameters 
disappearing in non-standard ways when going from K + 1 to K blocks. A permutation test was 



proposed in [4l[], but clearly more work is needed. There is also the question of what happens if K 
is allowed to grow with n, which is probably more realistic than fixed K; for the stochastic block 
model, this case has been considered by [3] and [sl], but their analysis is specific to the particular 
methods they considered and does not extend easily to the degree-corrected block model. Another 
open question is the properties of approximate but more easily computable solutions based on 
the eigendecomposition, as opposed to the properties of global maximizers we studied here. For 
the stochastic block model, part of this analysis was performed in [s^]- Our practical experience 
suggests that the behavior of eigenvectors can be quite complicated, and it is not understood at this 
point when this approximation works well. Finally, the sparse case A„ = 0(1) is an open problem 
in general, although results for some special cases of the stochastic block model have been recently 
obtained 



Appendix 

We start from summarizing notation. Let R{e),V{e) G 7^^x^xA^, n G 7e^^^'^ f{e),f{e) G 7^^, 
where 



1 " 

EI{ei = k,Ci = a,i 



Rkaui^) = - 7 = a = a, 6i = 

2 = 1 



VfcaM(e 



YTi=\ ^(fit = k,Ci = g, 9i 
Ya=i ^(ci = a,di = : 
1 " 

- I{ci = a,ei = Xu) 



n 

2 = 1 



1 " 

fk{e) = - I{ei = k) = y~^ Vkauie)tlau , 

2=1 au 

fl{e) = Y,Vkau{e)^au ■ 



Even though the arbitrary labeling e is not random, intuitively one can think of R as the empirical 
joint distribution of e, c, and 6, V as the conditional distribution of e given c and 0. Further, 11 
is the empirical joint distribution of c and 6, and thus an estimate of their true joint distribution 
n, / is the empirical marginal "distribution" of e, and is the same marginal but with the 
empirical joint distribution 11 replaced by its population version 11. Then J2k^kau{^) = Ij and 
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Vkau{c) = I{k = a) for all u. Further, define T(e) € TZ^^^ to be a rescaled expectation of the 
matrix O conditional on c and 9, 

fki{e) = —¥.[Oki\c,G] . 

From Proposition 14.11 

abuv 

= XuXvPstVkau{e)^auVlbv{e)tlbv ■ 

abuv 

Replacing 11 by its expectation 11, we define T{e) E TZ^^^ by 

abuv 



Also define ^(e) € -jz to be the rescaled difference between O and its conditional expectation, 

fJ'n 

These quantities will be used in the proof of the general theorem 14.11 where we first approximate 
^Ofci by Tfci(e) and then approximate ^^(e) by Tki{e). 

Proof of Proposition \4-i\ We only proof (jlip since (jl2p is trivial. 

—E[Oki\c,e] = 

= — ^ ^ E[AijI{ei = k,Ci = a,6i = Xu)I{ej = l,Cj = b,9j = Xy)\c,9] 

ij abuv 

= XuXvPabRkau{e)Ribv{e) = Hki{R{e)) . 

abuv 

□ 



Before we proceed to the general theorem, we state a lemma based on Bernstein's inequality. 
Lemma .1. Let ||-^||oo = max^; \Xigi\ and \e — c\ = X]r=i-^(^« ^^^^ 

P(max||X(e)|U > e) < 27^"+^ ^xp (^-^eV) , (13) 
for e < 3C, where C = maK{xuXvPab}- 

P( max ||X(e)-X(c)|U >e) <2f''V™+2expf-^e/i„') , (14) 

\e-c\<m. \mj \ ° J 

for e > QCm/n. 

P( max ||X(e)-X(c)||oo>e)<2f''y-+2exp(--^eVn) , (15) 
\e-c\<m \mj \ IbmG / 

for e < 6Cm/n. 
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Proof. This lemma is similar to Lemma 1.1 of [s*], but since the proof in 0] contains some relatively 
minor errors, we give a full proof here for completeness. First note that in order to prove psp . it 
is sufficient to show 

IP(I^fc/(e)I >e|c,6') <2exp('-^eVn] , (16) 



where 



Xki{e) = —[OM{e)-E{OM{e)\c,e)] , 

n n 

Oki (e) = ^ Aiil{ei = k,ej = I) + 2'^ Aijl{ei = k,ej =1) . 

1=1 i<j 



The proof relies on Bernstein's inequality (see e.g., [35|): If Yi are independent, \Yi\ < M,EYi = 

P(|5,l>»)<2exp(-^-^^^^L). ^n) 



Note that conditioning on c,9, Aij are independent and \Aij\ < 1. Let B = 2, and then (I17p 
becomes 

P(l.,A,(e)| > MCO) < 2exp (- + ^ <^«> 

In order to compare two terms in the denominator, we need to evaluate Var(Ofc;|c, 6): 

\aT{Aij\c,0) = PnOidjPc^c, - {pnOiejPc^cf < PnC , 

Yai{Oki\c, 6) <{n + 4(n - l)n/2)pnC < 2n^ pnC . 
Let w = epn = (-n^Pm for e < 3C, 

P(|X.(e)| > e|c,«) < 2exp (-^-^^^-j^^-^) 
We now prove ([H]) and ([T5|). If e^+i = c^+i, e„ = c^, 

m 

Oki{e) - Oki{c) = ^^{Aiil{ei = k,ei = l) - Aulia = k,Ci = I)) 
1=1 

m 

+ 2 ^(^jjl(ei = k, Cj =1) - Aijl{ci = k, cj = I)) 

i<j 

m n 

+ 2 J] {Aijl{ei = k, ej = I) - AijI{Q = k, cj = /)) . 

1=1 j=m+l 

var{Oki{e) — Oki{c)\c, 6) <[m + A{m{m — l)/2 + m{n — m))]pnC < AmnpnC . 
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We again apply (jl7p . For \e — c\ < m, e > 6Cm/n, 

FilXkiie) - Xm{c)\ > e\c,e) < 2exp 



AmnpnC + 2en2p„/3 



< 2exp --efin 



For e < 6Cm/n, 



¥{\Xki{e) - Xki{c)\ > €\c,e) < 2exp 

< 2exp 



{en'pny/2 



4mnpnC + 2en'^pn/3 



n 



16mC 



□ 



Proof of Theorem \4-l\ The proof is divided into three steps. 

Step 1: show that F (j-^,f{^)^ is uniformly close to its population version. More precisely, we 
need to prove that there exists e„ — )■ 0, such that 

'0(e) 



( max 



F 



Pn 



J{e) -F{T{e),r{e)) 



< e„ 1 ^ 1 , if A„, 



oo 



(19) 



Since 



iFl^Jie)] -F(r(e),/0(e))| < \F(^,f{e) ] - F(f (e), /(e))| 



Pn 



Pr. 



+ |F(f(e),/(e))-F(r(e),/0(e))| , 
it is sufficient to bound these two terms uniformly. By Lipschitz continuity, 

'0(e) 



Pn 



/(e) -F(T(e),/(e)) 



< Mi\\X(e) 



By (jl3p . (j20p converges to uniformly if A„ — )■ oo, and 

|F(f (e), /(e)) - F(r(e), /0(e))| < Mi||f (e) - r(e)|U + M^Wfie) - f\e) 
where || • || is Euclidean norm for vectors. Further, 

abuv 



>-bv) 



^bv\ 



and 



\fk{e)-fl{e) 



abuv 



E 



Vkau{e){Jlau - Fla 



< 



Eift 



au ^^au 



(20) 



(21) 



(22) 



(23) 
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p 

Since 11 — >■ H, ()2ip converges to uniformly. Thus, ()19p holds. 
Step 2: Prove that there exists (5„ — )• 0, such that 

Pf max Ff^,/(e)')<Ff^,/(c) 
where \\W\\i = Zkau l^kaul for W G 7e^xXxM_ 

By continuity and (*), there exists 5„ — ?> 0, such that 

F(r(c), /0(c)) - F(T(e), /0(e)) >26„ 
if ||F(e) - I||i > 6n, where I = V{c). Thus from l^9ij, 



max Ff^,/(e)')<Ff^,/(c)))> 



max ^ F(^,/(e))-^ max F{T{e),f{e)) 



{e:l|V'(e)-I||i>5„} \ ^^n J {e:||y(e)-I||i><5„} 

< en ) ^ 1. 



F[^J{c)]-F{T{c),fic)) 



(f24l) implies 

n\\V{c)-I\\<Sn)^l . 

Since 



1 1 

-|e - c| = - V/(q ^ e,) = Vn„„(l - Vaau{e)) < V(l - Ka«(e)) 
n n ^-^ ^-^ 

i=l au au 

\ au au k^a I 

weak consistency follows. 

Step 3: In order to prove strong consistency, we need to show that 

F(£M.;(e,)<F(^,/(c)))^l. 

{e:0<j|y(e)-I||i<5„} V A*". / V A*". // 



Note that combining ()24p and ()25p . we have 
which implies the strong consistency. 
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Here we closely follow the derivation given in [3]. To prove (j25p . note that by Lipschitz continuity 
and the continuity of derivatives of F with respect to V{e) in the neighborhood of I, we have 

F fie)) - F fie)) = F{f{e), /(e)) - F(f (c), /(c)) + A(e, c) (26) 



where |A(e,c)| < M'(||X(e) - X(c)||oo), and 

F(r(e), /0(e)) - F(r(c), /0(c)) < -C'\\V{e) o{\\V{e) - (27) 

Since the derivative of F is continuous with respect to V{e) in the neighborhood of I, there exists 
a 5' such that 

F(f (e), /(e)) - F(f(c), /(c)) < -(C"/2)||F(e) - + o(||y(e) - I||i) (28) 

holds when ||II — n||oo < f^'- Since 11 — > IT, (j28p holds with probability approaching 1. Combining 
()26|) and (I28p . it is easy to see that (I25p follows if we can show 

P(max |A(e,c)| < C7'||y(e) - I||i/4) ^ 1. (29) 
Again note that ^|e — c| < ^||y(e) — So for each m > 1, 

P( max !A(e,c)| >C"||y(e)-I||i/4) <P( max \\X{e) - X{c)\\^ > ■^^) = h (30) 

|e-c|=m \e-c\<m ZM'n 

Let a = C'/2M\ if a > 6C, by ([I 



h < 2i^™+2„'"exp ( -a— /i„ 



If a < 6C, by ([15]), 



8n' 

2K'^[K exp(logn - aAt„/(8/3n))]' 



/i < 2i^-+2n-exp (-a2_^^„ 



= 2i^^[i^exp(logn - QVn/(16Cn))]'" 

In both cases, since A„/logn — )• oo, 

P(max |A(e,c)| > C'||y(e) - I||i/4) = V P( max |A(e,c)| > C"||y(e) - I||i/4) ^ 0, 

{bt^c} ^-"^ |e— c|=m 

as n — >■ OO, which completes the proof. □ 

Proof of Theorem \3.^ The regularity conditions are easy to verify. To check the key condition (*), 
note that under the block model assumption, (*) becomes 

(**) F{H{S), h{S)) is uniquely maximized over y = {S : S > 0, J2k ^ka = t^o} by 5 = Z), with 
D = diag{7v). 
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where 5 is a generic K hy K matrix. 



Up to a constant, the population version of Qerm is 

FiH{S),hiS)) = Y,iHkk-hlPo). 

k 

Using the identity, 

k k^l kl k 



and define 

Am = 
Then we have 

F{H{S),hiS)) = ^Y.^ki{Hki-hkhiPo) = ^^AfcKj^5fca5/fePaf,- J^^fc.^zfcPo) 

kl kl ab ab 

= IY.Y1 SkaSlbAkliPab -Po)<IY.Y1 SkaSlbAabiPab " ^o) 
kl ab kl ab 

= ^Y.^-b7ra7rb{Pab - Po) = F{H{D),h{D)). 

ab 

Now it remains to show the diagonal matrix D (up to a permutation) is the unique maximizer of 
F. This follows from Lemma 3.2 in [^, since equality holds only if A^; = Aab when S^aSib > and 
A does not have two identical columns. 

□ 



1 if A; = / 
-1 if A: / I. 



Proof of Theorem \3.1[ The consistency of Newman-Girvan modularity under the block model has 
already been shown in HQ]. To extend this result to the degree-corrected block model, define S^a = 
T^u^uSkau- Then 

TTa = ^ Ska 

k 

Hkl = ^ ^ XuXyPabSkau^lbv — ^ ^ Ska.SlbPab 
abuv ab 

Hk = ^ Hkl = ^ SkaT^sPas 
I as 

The population version of Q tvg m is 
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Using the identity 



2N / X / X 2 



we obtain 



kl \ ^0 ^0 / 

< - V V 5..5,,A., ( ^ - (^s-sPas)iEt^tPbt) \ 

[Po PS J 

^ ab V^O ^0 / 

Similar to Theorem 13.21 D is the unique maximizer of F{H{S)), so it is enough to show S = B 
whenever S = D to prove uniqueness. S = D imphes Ska = 0, if k ^ a. Since Xu > 0, we obtain 
Skau = ii k ^ a, which gives the result. 

We note that this argument cannot be apphed to prove the consistency of Erdos-Renyi modularity 
under degree-corrected block models, because in that case = "^au^kau Sa(Su ^'^•^feaw) = 
X^a Ska, when we use the transformation Ska = Ylu ^uSkau- D 



Proof of Theoreni \3.4\ Up to a constant, the population version oi Qbl is 

F{H{S), h{S)) = (Uki log £| - Hk}j 

Let Qki = Hki/{hkhi), 

F{H{S),h{S)) = ^{Hkiioggki - hthigki) = SkaSwiPab^oggki - gn) 

kl abkl 

<Y.Y1 SkaSlbiPab log Pab " Pab) 
ab kl 

= J2(^aTTbPablogPab - TTaTTbPab) = F{H{D), h{D)). 
ab 

Since the inequality holds if and only if g^i = Pab when SknSib > 0, the uniqueness proof follows 
the next lemma which is a generalization of Lemma 3.2 in [5]. □ 

Lemma .2. Let g,P,S be K x K matrices with nonnegative entries. Assume that 

a) P and g are symmetric; 

h) P does not have two identical columns; 

c) there exists at least one nonzero entry in each column of S; 

d) for 1 < k,l,a,b < K, g^ = Pab whenever S^aSw > 0. 

Then S is a diagonal matrix or a permutation of a diagonal matrix. 
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Proof. The proof is similar to Lemma 3.2 



1) If there exists a permutation of the rows and columns of S such that its diagonals are all positive 
after permutation, i.e., Sbb > for 6 = 1, ...,K. If S is not diagnonal, there exists k ^ a such that 
5fca>0. For 6= 

SkaSbb > ^ gkb = Pah, 
SkkSbb > ^ gkb = Pkb, 
Pab = Pkb- 

This contradicts with b). 

2) If there does not exist such a permutation, then we can always permute row and columns of S, 
such that for some m > 1, Sij = for 1 < z, j < m, and Sbb > for 6 = m + 1, K. By c), there 
exists Skii > for i = 1, m and some ki G {m + 1, K}. Then 

Sk.iSkii > ^ Qk^ki = Pii = Pii, for i = 1, ...,m. 
Sk.iSkiki > =► gkiki = Piki = Pkii, for i = 1, ...,m. 

^ Pii = Pkii, for i = 1, m. (31) 
Sk^iSbb > => gkjb = Pib, for 6 = m + 1, i^. 
SkikiSbb > ^ fiifcib = Pfe.fe, for 6 = m + 1, ...,K. 

^ = Pk,b, for 6 = m + 1, iT. (32) 

dSH) and ([32]) contradict with b). □ 



Proof of Theorem \3.3[ Up to a constant, the population version of Qdcbm is 



F(/7(S)) = ^(Ffc,log 



kl 



kl 



HkHi 



H, 



kl 



(33) 



where we only check (**) (the form (*) takes under the block model). The generalization to the 
degree-corrected block model is similar to the proof of Theorem 13.11 and is omitted. 

Let gki = Hki/{HkHi), and 

F{H{S)) = Y,{Hkiloggki - HkHigki) 



kl 



E 

kl 



'^SkaSlbPab^Oggkl - C^Ska'n:sPas)C^nSlbPtb)g. 



ab 



bt 



SkaSib 



Pabloggkl - C^TTsPas)C^T^tPtb)9kl 



kl ab 

Since argmax^.(ci logx — C2x) = ci/c2, replacing g^i by 

Pab 



kl 



{T.s^sPas){T.t^tPtb) 
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we obtain 



Pab 



-Pah 




kl ab 



iJZs^sPc 

Pab 



'as){Y.tnPtb) 



T^aT^bPab 



F{H{D)). 



as 



XEt^Ptb) 



□ 
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